Goto

Collaborating Authors

 pathology image


A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction

Zhou, Huajun, Zhou, Fengtao, Ma, Jiabo, Xu, Yingxue, Wang, Xi, Zhang, Xiuming, Liang, Li, Li, Zhenhui, Chen, Hao

arXiv.org Artificial Intelligence

Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE's generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes.


Segment Anything in Pathology Images with Natural Language

Chen, Zhixuan, Hou, Junlin, Lin, Liqi, Wang, Yihui, Bie, Yequan, Wang, Xi, Zhou, Yanning, Chan, Ronald Cheong Kin, Chen, Hao

arXiv.org Artificial Intelligence

However, current segmentation methods encounter significant challenges in clinical applications, primarily due to the scarcity of high-quality, large-scale annotated pathology data and the constraints of fixed, narrowly defined object categories. To address these issues, this work aims to develop a segmentation foundation model capable of segmenting anything in pathology images using natural language. First, we establish PathSeg, the largest and most comprehensive dataset for pathology image semantic segmentation, derived from 21 publicly available datasets and comprising 275k image-mask-label triples. Our PathSeg dataset features a wide variety of 160 segmentation categories organized in a three-level hierarchy that covers 20 anatomical regions, 3 histological structures, and 61 object types. Next, we introduce PathSegmentor, a text-prompted foundation model tailored for pathology image segmentation. With PathSegmentor, users can achieve semantic segmentation simply by providing a descriptive text prompt for the target category, thus eliminating the need to laboriously provide numerous spatial prompts like boxes or points for each instance. Extensive experiments on both internal and external datasets demonstrate the superior segmentation performance of PathSegmentor. It outperforms the group of specialized models, effectively handling a broader range of segmentation categories while maintaining a more compact model size.


PathCoT: Chain-of-Thought Prompting for Zero-shot Pathology Visual Reasoning

Zhou, Junjie, Zuo, Yingli, Feng, Shichang, Wan, Peng, Zhu, Qi, Zhang, Daoqiang, Shao, Wei

arXiv.org Artificial Intelligence

With the development of generative artificial intelligence and instruction tuning techniques, multimodal large language models (MLLMs) have made impressive progress on general reasoning tasks. Benefiting from the chain-of-thought (CoT) methodology, MLLMs can solve the visual reasoning problem step-by-step. However, existing MLLMs still face significant challenges when applied to pathology visual reasoning tasks: (1) LLMs often underperforms because they lack domain-specific information, which can lead to model hallucinations. (2) The additional reasoning steps in CoT may introduce errors, leading to the divergence of answers. To address these limitations, we propose PathCoT, a novel zero-shot CoT prompting method which integrates the pathology expert-knowledge into the reasoning process of MLLMs and incorporates self-evaluation to mitigate divergence of answers. Specifically, PathCoT guides the MLLM with prior knowledge to perform as pathology experts, and provides comprehensive analysis of the image with their domain-specific knowledge. By incorporating the experts' knowledge, PathCoT can obtain the answers with CoT reasoning. Furthermore, PathCoT incorporates a self-evaluation step that assesses both the results generated directly by MLLMs and those derived through CoT, finally determining the reliable answer. The experimental results on the PathMMU dataset demonstrate the effectiveness of our method on pathology visual understanding and reasoning.


Learned Image Compression and Restoration for Digital Pathology

Lee, SeonYeong, Seong, EonSeung, Lee, DongEon, Lee, SiYeoul, Cho, Yubin, Park, Chunsu, Kim, Seonho, Seo, MinKyung, Ko, YoungSin, Kim, MinWoo

arXiv.org Artificial Intelligence

L earned Image C ompressionand R estorationfor Digital Pathology Preprint, compiled A pril 2, 2025 SeonY eong Lee 1, EonSeung Seong 1, DongEon Lee 1, SiY eoul Lee 1, Y ubin Cho 1, Chunsu Park 1, Seonho Kim 1, MinKyung Seo 1, Y oungSin Ko 3, and MinWoo Kim 1,2,* 1 Department of Information Convergence Engineering, Pusan National University, Y angsan, Korea 2 School of Biomedical Convergence Engineering, Pusan National University, Y angsan, Korea 3 Seegene Medical Foundation, Seoul, Korea The first two authors contributed equally to this work. A bstract Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression e ffi ciency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low-and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for e ffective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating e fficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. K eywords Learned Image Compression, Deep Learning, Wavelet Transform, Digital Pathology, Whole Slide Image. 1 I ntroduction Digital pathology images serve as fundamental data for various medical applications, playing a crucial role in cancer diagnosis, disease analysis, and treatment planning. These images are typically stored as Whole Slide Images (WSIs), which are characterized by ultra-high resolution (typically 0. 25µ m / px). A single uncompressed WSI can often exceed several gigabytes in size (e.g., 20-30 GB per image), posing significant challenges in terms of storage, transmission, and computational e ffi ciency.


M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images

Wang, Hongyi, Du, Xiuju, Liu, Jing, Ouyang, Shuyi, Chen, Yen-Wei, Lin, Lanfen

arXiv.org Artificial Intelligence

The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs).


Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering

Naeem, Awais, Li, Tianhao, Liao, Huang-Ru, Xu, Jiawei, Mathew, Aby M., Zhu, Zehao, Tan, Zhen, Jaiswal, Ajay Kumar, Salibian, Raffi A., Hu, Ziniu, Chen, Tianlong, Ding, Ying

arXiv.org Artificial Intelligence

Accurate diagnosis and prognosis assisted by pathology images are essential for cancer treatment selection and planning. Despite the recent trend of adopting deep-learning approaches for analyzing complex pathology images, they fall short as they often overlook the domain-expert understanding of tissue structure and cell composition. In this work, we focus on a challenging Open-ended Pathology VQA (PathVQA-Open) task and propose a novel framework named Path-RAG, which leverages HistoCartography to retrieve relevant domain knowledge from pathology images and significantly improves performance on PathVQA-Open. Admitting the complexity of pathology image analysis, Path-RAG adopts a human-centered AI approach by retrieving domain knowledge using HistoCartography to select the relevant patches from pathology images. Our experiments suggest that domain guidance can significantly boost the accuracy of LLaVA-Med from 38% to 47%, with a notable gain of 28% for H&E-stained pathology images in the PathVQA-Open dataset. For longer-form question and answer pairs, our model consistently achieves significant improvements of 32.5% in ARCH-Open PubMed and 30.6% in ARCH-Open Books on H\&E images. Our code and dataset is available here (https://github.com/embedded-robotics/path-rag).


PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding

Dai, Dawei, Zhang, Yuanhui, Xu, Long, Yang, Qianlan, Shen, Xiaojing, Xia, Shuyin, Wang, Guoyin

arXiv.org Artificial Intelligence

The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies has demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large language-vision assistant (PA-LLaVA) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder for pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PA-LLaVA, first stage for domain alignment, and second stage for end to end visual question \& answering (VQA) task. In experiments, we evaluate our PA-LLaVA on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PA-LLaVA model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA}{https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA


Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution

Chen, Wenting, Liu, Jie, Chow, Tommy W. S., Yuan, Yixuan

arXiv.org Artificial Intelligence

Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and misdiagnosis. Additionally, current methods allocate the same computational resources to recover each pixel of pathology image, leading to the sub-optimal recovery issue due to the large variation of pathology image. In this paper, we propose the first hierarchical reinforcement learning framework named Spatial-Temporal hierARchical Reinforcement Learning (STAR-RL), mainly for addressing the aforementioned issues in pathology image super-resolution problem. We reformulate the SR problem as a Markov decision process of interpretable operations and adopt the hierarchical recovery mechanism in patch level, to avoid sub-optimal recovery. Specifically, the higher-level spatial manager is proposed to pick out the most corrupted patch for the lower-level patch worker. Moreover, the higher-level temporal manager is advanced to evaluate the selected patch and determine whether the optimization should be stopped earlier, thereby avoiding the over-processed problem. Under the guidance of spatial-temporal managers, the lower-level patch worker processes the selected patch with pixel-wise interpretable actions at each time step. Experimental results on medical images degraded by different kernels show the effectiveness of STAR-RL. Furthermore, STAR-RL validates the promotion in tumor diagnosis with a large margin and shows generalizability under various degradations. The source code is available at https://github.com/CUHK-AIM-Group/STAR-RL.


Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction

Wang, Hongxiao, Yang, Yang, Zhao, Zhuo, Gu, Pengfei, Sapkota, Nishchal, Chen, Danny Z.

arXiv.org Artificial Intelligence

For predicting cancer survival outcomes, standard approaches in clinical research are often based on two main modalities: pathology images for observing cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene expressions. However, existing pathology-genomic multi-modal algorithms face significant challenges: (1) Valuable biological insights regarding genes and gene-gene interactions are frequently overlooked; (2) one modality often dominates the optimization process, causing inadequate training for the other modality. In this paper, we introduce a new multi-modal ``Path-GPTOmic" framework for cancer survival outcome prediction. First, to extract valuable biological insights, we regulate the embedding space of a foundation model, scGPT, initially trained on single-cell RNA-seq data, making it adaptable for bulk RNA-seq data. Second, to address the imbalance-between-modalities problem, we propose a gradient modulation mechanism tailored to the Cox partial likelihood loss for survival prediction. The contributions of the modalities are dynamically monitored and adjusted during the training process, encouraging that both modalities are sufficiently trained. Evaluated on two TCGA(The Cancer Genome Atlas) datasets, our model achieves substantially improved survival prediction accuracy.


New AI tech aims to detect the origin of cancers for optimal treatments: 'An important step'

FOX News

Dr. Marc Siegel discusses the pros and cons of using AI in health care and how it's too early to decide whether it's entirely reliable on on'Fox News Tonight.' For a small percentage of cancer patients, doctors are unable to determine where in the body the disease originated. To help pinpoint the origin of the cancers of unknown primary (CUP), researchers at the Massachusetts Institute of Technology (MIT) have created an artificial intelligence model that analyzes the patient's genetic information -- and predicts where the tumor first appeared. When using the new AI model for 900 patients with cancers of unknown origin, researchers found that they could accurately classify at least 40% of tumors, according to a study published in Nature Medicine. WHAT IS ARTIFICIAL INTELLIGENCE (AI)?